Red Wine Exploration by Marvin Lüthe

> Red Wine Dataset: This dataset holds information for 1599 red wines of the Portuguese “Vinho Verde” wine. The inputs include objective tests (e.g. pH values, fixed acidity, residual sugar, etc.) and the output is based on sensory data (wine quality between 0 and 10). The wine quality was graded by experts. We will use this dataset to find out how the input variables are related to the quality of the wine. At first, we will visualize our input variables to understand how they are distributed. Then, we will move one and perform a bivariate analysis. We will see how the input variables are correlated with each other and which input variables are correlated to the wine quality. Furthermore, we will draw multivariate plots to combine different variables and visualize their relation to the wine quality. Finally, we will use our gained knowledge to set up a model that predicts the quality of wine based on the given input variables.

> Dataset Variables:

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

> Dataset Summary:

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Univariate Plots Section

> Quality: This bar chart shows the distribution of quality. Most of the wines were rated 5 or 6 points out of 10 by the experts.

> Alcohol: The distribution of alcohol has a slight positive skew.

> Volatile Acidity: The volatile acidity is distributed around the mean of 0.53 g/dm^3.

> Sulphates: We can see that there are a few outliers in the sulphates distribution. We might keep this in mind for further explorations.

> Fixed Acidity: The fixed acidity is distributed around the mean of 8.32 g/dm^3.

> Citric Acid: The distribution for citric acid appears uniform. However, we can clearly see some outliers. 132 wines do not contain any citric acid. Another 68 wines have a citric acid of 0.49 g/dm^3. One wine has a citric acid of 1 g/dm^3.

## 
##    0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09  0.1 0.11 0.12 0.13 0.14 
##  132   33   50   30   29   20   24   22   33   30   35   15   27   18   21 
## 0.15 0.16 0.17 0.18 0.19  0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 
##   19    9   16   22   21   25   33   27   25   51   27   38   20   19   21 
##  0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39  0.4 0.41 0.42 0.43 0.44 
##   30   30   32   25   24   13   20   19   14   28   29   16   29   15   23 
## 0.45 0.46 0.47 0.48 0.49  0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 
##   22   19   18   23   68   20   13   17   14   13   12    8    9    9    8 
##  0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69  0.7 0.71 0.72 0.73 0.74 
##    9    2    1   10    9    7   14    2   11    4    2    1    1    3    4 
## 0.75 0.76 0.78 0.79    1 
##    1    3    1    1    1

> Residual Sugar: The distribution of residual sugar is centered on the left side with a few outliers located on the right side of the plot. This also may be interesting for further data explorations.

> Chlorides: The chlorides distribution is centered on the left side of the plot, either.

> Free Sulfur Dioxide: The free sulfur dioxide variable has a strong positive skew on a linear axis. This is why I decided to log transform the x-axis. Subsequently, the log transformed distribution resembles a normal distribution.

> Total Sulfur Dioxide: The Total Sulfur Dioxide distribution has been log transformed, either.

> Density: The density curve is distributed around its mean of 0.997 g/cm^3.

> pH: The average pH value for the red wines in our dataset is 3.3. The pH values are normally distributed.

Univariate Analysis

What is the structure of your dataset?

There are 1599 red wines in the dataset with 11 input variables and one output variable. All variables are numerical.

Input variables:
- fixed acidity (tartaric acid - g / dm^3)
- volatile acidity (acetic acid - g / dm^3)
- citric acid (g / dm^3)
- residual sugar (g / dm^3)
- chlorides (sodium chloride - g / dm^3
- free sulfur dioxide (mg / dm^3)
- total sulfur dioxide (mg / dm^3)
- density (g / cm^3)
- pH
- sulphates (potassium sulphate - g / dm3)
- alcohol (% by volume)

Output variable:
- quality (score between 0 and 10)

What is/are the main feature(s) of interest in your dataset?

The main feature in the dataset is the wine quality. I suspect that there are correlations between the input variables and the wine quality.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

I assume that every input variable is of interest to exploration purposes. However, I assume that alcohol will be interesting for us in this data exploration since the alcoholic strength might have had an impact on the experts rating. :)

Did you create any new variables from existing variables in the dataset?

I did not create any new variables.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form  of the data? If so, why did you do this?

I log-transformed the right skewed free sulfur dioxide and total sulfur dioxide distributions. The transformed distributions for dioxides appear normally distributed with the free sulfur dioxide peaking around 10 mg / dm^3 and the total sulfur dioxide peaking around 40 mg / dm^3.

Moreover, we can clearly see that the distributions for residual sugar, chlorides and sulphates are centered to the left with only a few outliers with high values. I was also surprised by the quality ratings as I expected to see the ratings in a wider range (IQR of 3-8 instead of 5-6).

Bivariate Plots Section

The following matrices show the correlations between the variables:

> Approach: At first, we will have a look at the variables which are strongly correlated with the quality of the wines. The top 3 features are:
- alcohol (correlation: 0.476)
- volatile acidity (correlation: -0.391)
- sulphates (correlation: 0.251)

Then we will determine the strongest correlations amongst all variables.
- citric acid vs fixed acidity (correlation: 0.672)
- pH vs fixed acidity (correlation: -0.683)

> Bivariate scatterplot quality vs alcohol: The wine quality is positively correlated with the alcohol strength. The correlation is r = 0.476. The boxplots on the right hand illustrate a strong linear relationship for the quality ratings between 5 and 8.

> Bivariate scatterplot quality vs volatile acidity: Volatile acidity has the strongest negative correlation with the wine quality. Both the linear regression and the boxplots overlaying the scatterplots emphasize this relationship.

> Bivariate scatterplot quality vs sulphates: The sulphates variable has the third strongest correlation with the quality variable. With regard to the univariate analysis of sulphates we can now see that most of the outliers are centered around middle quality ratings of 5-6. This is not surprising as most of the wines in this dataset lie in this range.

> Bivariate scatterplots fixed acidity vs citric acid and fixed acidity vs pH: On the left hand we can see the strong positive relationship between fixed acidity and citric acid. This makes a lot of sense as citric acid is one of the fixed acids in wines along with malic and tartaric acids.
On the right hand we can examine a very strong negative correlation between the fixed acidity and pH values. This should not surpise us, neither. Lower pH values result in more acidic liquids and vice versa.

> Bivariate scatterplots fixed acidity vs density and alcohol vs density: On the left hand we see the positive correlation between fixed acidity and density. The scatterplot on the right hand shows the negative relationship between alcohol and density.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

The wine quality is positively correlated with the alcohol strength (correlation: 0.476). Furthermore, the quality is negatively correlated with the volatile acidity (correlation: -0.391). The third most important driver of the wine quality is the sulphates parameter (correlation: 0.251).

The strongest relationships among the input variables were between fixed acidity and citric acid as well as between fixed acidity and pH values. These data patterns confirm our basic knowledge about chemistry.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Strong, interesting relationships which I would not have expected are between:
- fixed acidity and density (correlation: 0.668)
- density and alcohol (correlation: -0.496)

What was the strongest relationship you found?

The strongest correlation was between fixed acidity and pH (correlation: -0.683).

Multivariate Plots Section

> Wine quality by alcohol and volatile acidity: We have incorporated a new leveling for the quality rating: Bad, Middle and Good. This makes it easier for us to identify remarkable patterns in the data.
The first multivariate plot shows the wine quality by alcohol and volatile acidity. These variables have the strongest correlations with the wine quality. Below we can recognize that the wine quality increases in the upper left corner of the scatterplot.

> Wine quality by alcohol and sulphates: Here we can see how the wine quality changes for different alcohol strengths and sulphates inputs. The better wines seem to have input values in the upper right corner of the scatterplot.

> Wine quality by volatile acidity and sulphates: This scatterplot once again shows the negative correlation between volatile acidity and wine quality. Wines with low sulphates values and a high volatile acidity seem to get bad ratings.

> Wine quality by citric acid and fixed acidity: In this plot it is not possible to determine a clear pattern.

> Wine quality by pH and fixed acidity: This multivariate plot does not provide us with clear patterns, neither.

> Wine quality by density and fixed acidity: This plot shows us a very positive correlation between fixed acidity and density. For a given fixed acidity, we can recommend that it makes sense to aim for a low density.

> Wine quality by density and alcohol: This plot visualizes that the alcohol strength should be at least 10% as the wines on the right corner are rated better than the rest.

> Prediction Model:
- output variable: quality
- input variables: alcohol, volatile acidity, sulphates, total sulfur dioxide, citric acid

## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = df)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = df)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + sulphates, 
##     data = df)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     total.sulfur.dioxide, data = df)
## m5: lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     total.sulfur.dioxide + citric.acid, data = df)
## 
## ==============================================================================================
##                              m1            m2            m3            m4            m5       
## ----------------------------------------------------------------------------------------------
##   (Intercept)               1.875***      3.095***      2.611***      2.826***      2.843***  
##                            (0.175)       (0.184)       (0.196)       (0.201)       (0.205)    
##   alcohol                   0.361***      0.314***      0.309***      0.295***      0.295***  
##                            (0.017)       (0.016)       (0.016)       (0.016)       (0.016)    
##   volatile.acidity                       -1.384***     -1.221***     -1.199***     -1.222***  
##                                          (0.095)       (0.097)       (0.097)       (0.112)    
##   sulphates                                             0.679***      0.712***      0.721***  
##                                                        (0.101)       (0.101)       (0.103)    
##   total.sulfur.dioxide                                               -0.002***     -0.002***  
##                                                                      (0.001)       (0.001)    
##   citric.acid                                                                      -0.043     
##                                                                                    (0.104)    
## ----------------------------------------------------------------------------------------------
##   R-squared                 0.227         0.317         0.336         0.344         0.344     
##   adj. R-squared            0.226         0.316         0.335         0.342         0.342     
##   sigma                     0.710         0.668         0.659         0.655         0.655     
##   F                       468.267       370.379       268.912       208.768       166.962     
##   p                         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood        -1721.057     -1621.814     -1599.384     -1589.835     -1589.749     
##   Deviance                805.870       711.796       692.105       683.887       683.814     
##   AIC                    3448.114      3251.628      3208.768      3191.669      3193.499     
##   BIC                    3464.245      3273.136      3235.654      3223.932      3231.138     
##   N                      1599          1599          1599          1599          1599         
## ==============================================================================================

> Prediction: Following variables have been defined to predict the wine quality:
- Alcohol: 12%
- Volatile Acidity: 0.3g/dm^3
- Sulphates: 0.75g/dm^3
- Total Sulfur Dioxide: 25 mg / dm^3
- One standarad deviation confidence interval

##        fit      lwr      upr
## 1 6.340534 5.688252 6.992816

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

The multivariate plots helped to understand how good wines are different in terms of input values compared to the other wines in the dataset. Some plots enabled us to see clear patterns e.g. the plots ‘Wine quality by alcohol and volatile acidity’, ‘Wine quality by alcohol and sulphates’, ‘Wine quality by volatile acidity and sulphates’ and ‘Wine quality by density and alcohol’. If we tried to select appropriate features for further machine learning activities, these visualizations would help us in understanding the data and selecting the best features.

Were there any interesting or surprising interactions between features?

For me, the surprising plots were the ‘Wine quality by density and fixed acidity’ and ‘Wine quality by density and alcohol’. I did not expect a strong relationship between the input variables and was even more surprised by the clarity of patterns in the data.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

Yes, I created a prediction model. For this purpose, I selected the quality to be the output variable and the variables which are correlated with the wine quality as input for the linear regression model. I calculated five different models, but the last model did not show any improvement in terms of pearson’s r. The maximum pearson’s r is 0.344. This is not a high value so the prediction model is not very precise. But we can use this model to get a feeling in which range (e.g. one standard deviation confidence interval) the quality of the wine might lie given the input variables alcohol, volatile acidity, sulphates and total sulfur dioxide.


Final Plots and Summary

Plot One

Description One

These bivariate plots illustrate the correlation between the alcohol strength and the wine quality. Both the linear regression line on the left hand and the boxplots on the right hand show that these variables are positively correlated. This means that wines with a higher alcohol strength tend to have a higher quality.

Plot Two

Description Two

This multivariate plot examines how the quality is impacted by the alcohol strength and the sulphates input. This plot shows that both alcohol and sulphates are positively correlated with the wine quality. This results in a higher density of good wines in the upper right corner.

Plot Three

Description Three

This plot shows us a very positive correlation between fixed acidity and density. For a given fixed acidity, we can recommend that it makes sense to aim for a low density. In contrast, if we work with a given wine density, we can state that it makes sense to produce higher acidic wines as those tend to receive better ratings.


Reflection

In this project we have explored a dataset of 1599 red wines. We have looked at the variables itself and how they are correlated with each other. We have drawn specific attention to the input variables which are strongly correlated with the associated wine quality. In the bivariate plots section we illustrated these correlations and finalized it in the multivariate plots section. My main struggle in the beginning was driven by the fact that many wines are rated in a very small range of 5-6. This made it difficult to create effective multivariate plots. We have solved this problem by categorizing the quality integer values into bad, middle and good wines.

This dataset might also be interesting for machine learning applications. My regression model was based on a very simple linear relationship between the variables. However, I can imagine that we can significantly improve the performance of the prediction model by selecting the best features and using more sophisticated machine learning algorithms e.g. decision tree or support vector regression. The multivariate plots illustrated that promising decision surfaces can be built based on the input variables.